Chapter 4

Question 1, 2 and 3

dataset description:

The Boston research seems to look for correlations between the standard of living, housing quality, and other criteria such as the urban environment, accessibility, and demographics within the Boston area.

Each row represents data for a specific town and has information about various factors such as crime rate, proportion of non-retail business acres, property tax rates, pupil-teacher ratios, etc.

This data frame contains the following columns:

  • crim - per capita crime rate by town.

  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus - proportion of non-retail business acres per town.

  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox - nitrogen oxides concentration (parts per 10 million).

  • rm - average number of rooms per dwelling.

  • age - proportion of owner-occupied units built prior to 1940.

  • dis - weighted mean of distances to five Boston employment centres.

  • rad - index of accessibility to radial highways.

  • tax - full-value property-tax rate per $10,000.

  • ptratio - pupil-teacher ratio by town.

  • black - 1000(Bk−0.63) square Bk is the proportion of blacks by town.

  • lstat - lower status of the population (percent).

  • medv - median value of owner-occupied homes in $1000s.

Data analyzes:

Explore the structure and the dimensions of the dataset. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, the distributions of the variables and the relationships between them.

library(MASS)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)

setwd("/Users/margot/Desktop/Desktop - MacBook Pro de MARGOT/Open data with R 2023/IODS-project")

Boston <- Boston
# explore the mean, median, min, max per variable
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# 506 entries with 14 variables
dim(Boston)
## [1] 506  14
# Data type: all num or int
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# let's create points plot for each variable. -> not easy to read as too many variables
pairs(Boston)

# Let's analyze the histogram for each variable. To use the key facet wrap that will draw one chart per key, we need to have a table with 2 columns: the keys which are all the variables and then the metric. 
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Let's compute a correlation matrix
matrix_correlation_var <- cor(Boston)

# Visualize correlation matrix as a heatmap
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
ggplot(data = melt(matrix_correlation_var), aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)

according to the matrix - Negative relationships between certain variables:

lstat and medv - lower status of population and the median price of homes owned by occupants

lstat and rm - lower status of population and the average of room numbers per home

tax and medv - the percentage at which a property is taxed based on its assessed value and the the median price of homes owned by occupants

dis and lstat - the weighted average distance from each house to these employment centers and the lower status of population

dis and age - the weighted average distance from each house to these employment centers and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940.

dis and nox - the weighted average distance from each house to these employment centers and nitrogen oxides concentration.

dis and indus - the weighted average distance from each house to these employment centers and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

tax and dis - the percentage at which a property is taxed based on its assessed value and the weighted average distance from each house to these employment centers

zn and age - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940

zn and nox - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the nitrogen oxides concentration.

zn and indus - the percentage or fraction of land within residential areas that is for individual lots with a minimum size of over 25,000 square feet and the proportion of non-retail business acres per town (manufacturing facilities, industrial parks, office spaces, warehouses etc).

Positive relationships between variables

medv and rm - the median price of homes owned by occupants and the average of room numbers per home

tax and indus - the percentage at which a property is taxed based on its assessed value and the proportion of non-retail business acres per town.

tax and nox - the percentage at which a property is taxed based on its assessed value and the nitrogen oxides concentration.

age and indus - the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the proportion of non-retail business acres per town.

age and nox- the percentage or fraction of homes that are owner-occupied and were constructed before the year 1940 and the nitrogen oxides concentration.

nox and indus - the nitrogen oxides concentration and the proportion of non-retail business acres per town.

–> tests in my model: sltat, medv, rm, tax, dis, age, nox, indus, zn

Test some models based on the dataset and observed correlation

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
model1 <-  glm(lstat ~ medv + dis + rm,data=Boston)
model2 <-  glm(medv ~ rm + tax + lstat ,data=Boston)

# all the VIF are between 1 and 2.1 which is ok. It suggest a low multicollinearity and imply that the variance of the estimated coefficients is not significantly inflated due to collinearity
vif(model1)
##     medv      dis       rm 
## 1.982257 1.068810 1.940168
vif(model2)
##       rm      tax    lstat 
## 1.610953 1.426005 2.092901
#let’s calculate the corresponding ODDS ratios and confidence intervals (95%):
OR1 <- coef(model1) %>% exp
OR2 <- coef(model2) %>% exp
CI1 <- confint(model1) %>% exp
## Waiting for profiling to be done...
CI2 <- confint(model2) %>% exp
## Waiting for profiling to be done...
# the confidence interval for an odds ratio doesn't span 1 = there's a statistically significant effect for both models
cbind(OR1, CI1) 
##                      OR1        2.5 %       97.5 %
## (Intercept) 1.764803e+16 4.023272e+14 7.741285e+17
## medv        6.606608e-01 6.250484e-01 6.983022e-01
## dis         3.292580e-01 2.756488e-01 3.932934e-01
## rm          1.682751e-01 8.210665e-02 3.448749e-01
cbind(OR2, CI2)
##                     OR2        2.5 %      97.5 %
## (Intercept)   0.6073487  0.001289619 286.0319984
## rm          181.1868455 76.546116788 428.8744403
## tax           0.9935198  0.990167804   0.9968832
## lstat         0.5754723  0.522466602   0.6338557
# the residual deviance is way smaller than the null deviance. It indicates a reasonably good fit of the model to the data.
summary(model1)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.40940    1.92918  19.391  < 2e-16 ***
## medv        -0.41451    0.02827 -14.662  < 2e-16 ***
## dis         -1.11091    0.09067 -12.252  < 2e-16 ***
## rm          -1.78215    0.36612  -4.868 1.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.22406)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  8646.5  on 502  degrees of freedom
## AIC: 2882.2
## 
## Number of Fisher Scoring iterations: 2
# medv and rm also influences each other, so let's modify the model a bit
model11 <-  glm(lstat ~ medv + dis + rm + rm * medv ,data=Boston)
summary(model11)
## 
## Call:
## glm(formula = lstat ~ medv + dis + rm + rm * medv, data = Boston)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.41052    3.64466   20.42   <2e-16 ***
## medv        -1.94316    0.13517  -14.38   <2e-16 ***
## dis         -0.89569    0.08285  -10.81   <2e-16 ***
## rm          -7.55541    0.59817  -12.63   <2e-16 ***
## medv:rm      0.22526    0.01957   11.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 13.64918)
## 
##     Null deviance: 25752.4  on 505  degrees of freedom
## Residual deviance:  6838.2  on 501  degrees of freedom
## AIC: 2765.5
## 
## Number of Fisher Scoring iterations: 2
# same for this one, residual deviance is way smaller than the null deviance. 
summary(model2)
## 
## Call:
## glm(formula = medv ~ rm + tax + lstat, data = Boston)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.498652   3.140239  -0.159 0.873895    
## rm           5.199529   0.439618  11.827  < 2e-16 ***
## tax         -0.006501   0.001724  -3.770 0.000182 ***
## lstat       -0.552564   0.049302 -11.208  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 29.90866)
## 
##     Null deviance: 42716  on 505  degrees of freedom
## Residual deviance: 15014  on 502  degrees of freedom
## AIC: 3161.4
## 
## Number of Fisher Scoring iterations: 2

Question 4

Standardizing the data. Variables often have different scales and units, making direct comparisons challenging. Standardization brings all variables to a common scale, allowing for fair comparisons between different variables. It makes the distribution of each variable more consistent, with a mean of 0 and a standard deviation of 1. This normalization aids in interpreting coefficients and comparing the relative importance of different predictors in regression. Standardizing ensures that each variable contributes equally to these techniques, preventing one variable from dominating the analysis due to its scale. It allows easier comparison of the magnitude of the effect of each variable on the outcome. Finally it can mitigate issues related to multicollinearity in regression analysis by putting variables on a similar scale, reducing the impact of differing scales on regression coefficients.

# select numerical values
Boston_numeric_cols <- Boston[, sapply(Boston, is.numeric)]

# The scale() to standardize and transform the data to have a mean of 0 and a standard deviation of 1.
scaled_boston <- scale(Boston_numeric_cols)

# convert to a data frame
scaled_table_boston <- as.data.frame(scaled_boston)

# how did the data change? Mean is now 0 so it has worked;
summary(scaled_table_boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# use the cut function to create categorical variables based on intervals or breaks in a numerical variable. We do this process for the crim variable from 0 to 0.25 to 0.50 to 0.75 to 1 (quantiles). Add labels for each category. 
# include.lowest = TRUE is to ensure there is no NA category.
quantiles <- quantile(Boston$crim, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
interval_labels <- c("premier_quantil", "second_quantil", "third_quantil", "fourth_quantil")  

scaled_table_boston$quantiles_crime <- cut(Boston$crim, quantiles, labels= interval_labels,include.lowest = TRUE)
# Notes: Quantiles derived from numeric values can be considered categorical or continuous. When quantiles represent discrete categories or bins that divide a continuous variable into distinct groups, they are treated as categorical (e. small, medium, big). If quantiles represent numeric values that indicate the position or value relative to the distribution of a continuous variable (e.g., the 25th percentile, median, 75th percentile), they are considered continuous.

# drop the former column crim and create a new table. 
Boston_new <- scaled_table_boston %>% dplyr::select(-crim)

# We need 80% of the rows from total rows
train_size <- round(0.8 * nrow(Boston_new))

# Select a sample randomly among the dataset 80%
train_set <- sample(seq_len(nrow(Boston_new)), size = train_size)

# Create training and testing subsets
train_data <- Boston_new[train_set, ]
test_data <- Boston_new[-train_set, ]

Question 5:

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

Linear Discriminant Analysis (LDA) can be quite useful in various real-life scenarios where classification or dimensionality reduction is required such as: Marketing and Customer Segmentation, product quality control, credit scoring.

It is ideal when we have multiple variables that can help us categorize each entry in one specific group where all others will have similar mean, and variance. The goal is to predict the categorical outcome or class based on a set of predictors.

LDA transforms the original variables into a smaller set of new variables, known as linear discriminants. These new variables (linear discriminants) are created in a way that maximizes the separation between different categories or classes in the data.By using the information captured in the linear discriminants, LDA helps in accurately assigning new observations (data points) to their respective categories or classes. LDA takes variables, combines them in a smart way to create new variables that are helpful for understanding differences between groups or categories, and then uses this understanding to categorize or classify new data based on what it has learned. This makes it useful for both simplifying data and making predictions about categories in that data.

LDA calculates the mean and variance for each predictor within each class or category in the target variable. It finds coefficients for these combinations by considering differences in means between classes and assumes equal variances for each predictor within classes. It assumes that the data approximately follows a normal distribution.

# check the impact of each variable on a categorical variable (quantiles_crime)
# each quantiles are approximately equal 25%
# LD1 explains 96% of the model. When LD1 captures 96% of the variance, it suggests that LD1 effectively separates the classes or groups in the data based on their mean differences and variance. Within LD1, data points are grouped together in a way that maximizes the separation between classes while minimizing the variation within each class.
# Positive coefficients indicate that higher values of that variable contribute to a higher score in that particular discriminant, while negative coefficients suggest the opposite. The larger the magnitude of the coefficient, the greater the impact of that variable on the discriminant.

library(MASS)

# Fit LDA on the train set - LDA is a statistical method used for classification. It fits an LDA model using the train_data, where quantiles_crime is the target variable to predict based on the other variables
lda_model <- lda(quantiles_crime ~ ., data = train_data)

# Predict on the test set by using the function predict to apply the model created from the train_data to test_data -> the idea here is to predict the class labels for this test_data
lda_pred <- predict(lda_model, test_data)$class # class extract the predicted class labels

# actual crime quantiles
actual_crime_categories <- test_data$quantiles_crime

# Extract LD scores from the LDA model's predictions
lda_scores <- predict(lda_model, test_data)$x # x accesses the matrix of posterior probabilities or scores associated with each class for the observations in the test_data.

# Create a dataframe with LD scores from first 2 columns of the lda_score and the predicted classes. Combining LD1, LD2, and the predicted classes for visualization. In many cases, visualizing beyond LD1 and LD2 might become complex to display effectively in two-dimensional plots. LD1 and LD2 are typically chosen for visualization as they capture the most discrimination power while allowing for a clearer visualization on a 2D plot.
plot_data <- data.frame(LD1 = lda_scores[, 1], LD2 = lda_scores[, 2], prediction_crime_quantile = as.factor(lda_pred))

plot_data
##            LD1         LD2 prediction_crime_quantile
## 2   -3.5745221 -0.64451057            second_quantil
## 5   -3.4220789 -0.86983808            second_quantil
## 12  -2.2778287 -0.51988976            second_quantil
## 26  -1.9948071 -0.72504745             third_quantil
## 30  -2.2250252 -0.94166175             third_quantil
## 32  -2.0697184 -0.74021870             third_quantil
## 34  -1.9327759 -0.71202182             third_quantil
## 38  -2.5257378  0.12552915            second_quantil
## 41  -3.6883661  2.74301374           premier_quantil
## 50  -3.2218646 -0.48961161            second_quantil
## 54  -3.4886716  0.80145191            second_quantil
## 56  -2.9561637  2.99590204           premier_quantil
## 61  -1.0952529  0.88885262            second_quantil
## 64  -1.5122987  0.75608005            second_quantil
## 68  -3.4640040  0.86864157            second_quantil
## 74  -3.6071979  0.47262937            second_quantil
## 84  -3.0537943  1.12073084           premier_quantil
## 86  -3.4136557 -0.16879475            second_quantil
## 87  -3.3812279 -0.01841781            second_quantil
## 96  -3.7141816 -0.14565633            second_quantil
## 98  -3.6625942 -0.82032159            second_quantil
## 102 -2.0484837 -0.47389005             third_quantil
## 105 -1.7267137 -0.38899916             third_quantil
## 108 -1.7101454 -0.36138498            second_quantil
## 110 -1.6657038 -0.48652102             third_quantil
## 119 -1.4709537 -0.49855170            second_quantil
## 120 -1.6909690 -0.33211341            second_quantil
## 144 -0.7799700 -3.37780537             third_quantil
## 146 -0.5299011 -3.41382339             third_quantil
## 148 -0.6915666 -3.37275532             third_quantil
## 151 -1.0529042 -3.33615241             third_quantil
## 152 -0.9182570 -3.22561943             third_quantil
## 153 -1.4552906 -2.85225046             third_quantil
## 154 -0.8056967 -3.31488559             third_quantil
## 156 -1.2247542 -3.03251151             third_quantil
## 160 -1.1911639 -3.28775772             third_quantil
## 162 -1.8038519 -2.65243794             third_quantil
## 164 -2.2509649 -2.80118899             third_quantil
## 170 -1.7856269 -1.58157721             third_quantil
## 173 -1.8835851 -0.44340913            second_quantil
## 177 -2.5518759 -0.05398363            second_quantil
## 178 -2.3534652 -0.24587312            second_quantil
## 180 -3.1799123 -0.65641047            second_quantil
## 186 -2.8931501 -0.67165745            second_quantil
## 187 -3.1694365 -1.34082547             third_quantil
## 191 -3.0641996  1.33136045           premier_quantil
## 195 -4.7564356  2.22622507           premier_quantil
## 200 -3.8161901  3.30031073           premier_quantil
## 201 -3.8712572  3.41421853           premier_quantil
## 203 -4.3220368  2.46267508           premier_quantil
## 212 -2.4225096 -0.91434081            second_quantil
## 215 -2.6841155 -0.49241490            second_quantil
## 222 -0.9425802 -0.46955637            second_quantil
## 223 -1.3621278 -0.33403365            second_quantil
## 227 -1.0571748 -0.72562941             third_quantil
## 230 -1.6589978  0.25130640            second_quantil
## 240 -2.5317213  1.35559501           premier_quantil
## 244 -2.9545547  1.61905871           premier_quantil
## 246 -1.4500940  0.49908421            second_quantil
## 256 -4.7247043  2.82783674           premier_quantil
## 257 -3.5775130  2.60475285           premier_quantil
## 261 -1.9236104 -0.98261239             third_quantil
## 263 -1.8202901 -1.81293052             third_quantil
## 264 -1.8071696 -1.01562590             third_quantil
## 269 -2.4822814 -0.63165734             third_quantil
## 270 -3.3887611  0.48116633            second_quantil
## 281 -2.6231069 -0.25079024            second_quantil
## 283 -3.1371389 -0.20779963            second_quantil
## 291 -3.1506741  3.29066633           premier_quantil
## 296 -3.5450102 -0.44912043            second_quantil
## 297 -3.2952716 -0.60718224            second_quantil
## 300 -3.4078611  2.85303745           premier_quantil
## 302 -2.1192646  1.66623895           premier_quantil
## 307 -1.4023689  1.04675898           premier_quantil
## 320 -2.6227967 -0.74874946            second_quantil
## 322 -2.4787364 -0.17664420            second_quantil
## 323 -2.5139839 -0.03705061            second_quantil
## 328 -2.4786305 -0.33913299            second_quantil
## 335 -2.6642772 -0.28091928            second_quantil
## 337 -2.4212578 -0.17764528            second_quantil
## 341 -2.3037193 -0.17933103            second_quantil
## 348 -3.3059511  3.04292152           premier_quantil
## 358  6.0641561 -0.65748797            fourth_quantil
## 364  6.1629359 -0.32157529            fourth_quantil
## 366  6.7268291 -0.15621779            fourth_quantil
## 375  7.0691112 -0.10099902            fourth_quantil
## 392  6.3849977 -0.26255291            fourth_quantil
## 397  6.2988380  0.26248748            fourth_quantil
## 405  6.5247026  0.34647943            fourth_quantil
## 416  6.9956270  0.15506546            fourth_quantil
## 419  6.9227909  0.34631672            fourth_quantil
## 434  6.6417870  0.01626303            fourth_quantil
## 435  6.6924080  0.13295734            fourth_quantil
## 444  6.4871272 -0.28889484            fourth_quantil
## 461  6.4142022 -0.12659301            fourth_quantil
## 471  5.9122264  0.65011128            fourth_quantil
## 473  5.7946575  0.64395448            fourth_quantil
## 477  6.0921746  0.52331548            fourth_quantil
## 480  6.0559341  0.56822066            fourth_quantil
## 491 -1.3982813 -1.61997283            second_quantil
## 499 -1.5148751 -0.60642015             third_quantil
# Create a scatterplot of LD1 and LD2
plot_LDA <- ggplot(plot_data, aes(x = LD1, y = LD2, color = prediction_crime_quantile)) +
  geom_point() +
  labs(title = "LDA Biplot with Predicted Crime Quantiles")

plot_LDA

# adding real values - comparison of actual vs predicted values in test_data
realVsPred_plot <- plot_LDA + 
  geom_point(aes(color = actual_crime_categories), size = 4, alpha = 0.1) +
  labs(color = "Real Quantiles of Crime")

realVsPred_plot

# the accuracy of predictions using test data
accuracy <- mean(lda_pred == test_data$quantiles_crime)
print(paste("Accuracy of LDA model on test data:", round(accuracy * 100, 2), "%"))
## [1] "Accuracy of LDA model on test data: 67.33 %"

Question 6:

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)

# save the crime data: 
actual_crime_categories <- test_data$quantiles_crime

# Remove the categorical crime variable from the test dataset
test_data_without_crime <- test_data %>% dplyr::select(-quantiles_crime)

# get the classes based on the model - this was done earlier with lda_pred. so I am a bit confused.
lda_pred_test <- predict(lda_model, test_data_without_crime)$class

# get the table with the prediction vs actual - results are same between the 2 ways since I did 2 times the same steps. I might have missed something in the requests.
cross_tab <- table(Predicted = lda_pred_test, Actual = actual_crime_categories)
cross_tab
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              14              3             0              0
##   second_quantil               17             17             7              0
##   third_quantil                 1              3            21              1
##   fourth_quantil                0              0             1             16
cross_table <- table(Predicted = lda_pred, Actual = actual_crime_categories)
cross_table
##                  Actual
## Predicted         premier_quantil second_quantil third_quantil fourth_quantil
##   premier_quantil              14              3             0              0
##   second_quantil               17             17             7              0
##   third_quantil                 1              3            21              1
##   fourth_quantil                0              0             1             16

Question 7:

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# Boston is reload
Boston_load <- Boston

# scale of Boston
Boston_scaled <- scale(Boston_load)

# Calculate distance betwwen scaled data points: 
distances <- dist(Boston_scaled)

# Visualize the distances using fviz_dist()
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_dist(distances, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# k mean: The k-means algorithm groups observations into clusters based on their similarity, aiming to minimize the variation within clusters while maximizing the variation between clusters. We need to define the number of clusters we need to do the kmean analyse. 

# Elbow Method to find optimal number of clusters: Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
# The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
# The WSS score is the sum of these Squared Errors for all the points.
wss <- numeric(10)  # Initialize within-cluster sum of squares vector
for (i in 1:10) {
  kmeans_model <- kmeans(Boston_scaled, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss  # Store within-cluster sum of squares
}

# Plotting the Elbow Method
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-cluster Sum of Squares", main = "Elbow Method")

# or we can use the direct function for the Elbow method
kmeans_model_optimal2 <- fviz_nbclust(Boston_scaled, kmeans, method = "wss")
# or we can use the silhuette method
fviz_nbclust(Boston_scaled, kmeans, method = "silhouette")

# Visualize clusters using pairs() or ggpairs()
pairs(Boston_scaled, col = kmeans_model$cluster)

k <- 2  # 2 seems to be the best option according to the Elbow Method and the silhouette method 

#Kmean model:
kmeans_model <- kmeans(Boston_scaled, centers = k, nstart = 25)

cluster_assignments <- kmeans_model$cluster

# visualize the clusters thanks to fviz_cluster function
fviz_cluster(kmeans_model, data = Boston_scaled)

library(GGally)

# Combine the scaled data with the cluster assignments (cluster 1 or 2)
clustered_data <- cbind(as.data.frame(Boston_scaled), Cluster = as.factor(cluster_assignments))

# Visualize clusters using ggpairs()
ggpairs(clustered_data, aes(color = Cluster))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Mean for each cluster and variable:
clustered_data %>%
  mutate(Cluster = kmeans_model$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")
## # A tibble: 2 × 15
##   Cluster   crim     zn  indus     chas    nox     rm    age    dis    rad
##     <int>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1       1 -0.389  0.262 -0.615  0.00291 -0.582  0.245 -0.433  0.454 -0.583
## 2       2  0.724 -0.487  1.14  -0.00541  1.08  -0.455  0.805 -0.844  1.08 
## # ℹ 5 more variables: tax <dbl>, ptratio <dbl>, black <dbl>, lstat <dbl>,
## #   medv <dbl>
model_predictors <- train_data %>% dplyr::select(-quantiles_crime)

# check the dimensions
dim(model_predictors)
## [1] 405  13
dim(lda_model$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_model$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')